Document Similarity in Repeatedly Translated Corpora

نویسندگان

  • Vladimir Mateljan
  • Vedran Juričić
  • Dario Ogrizović
چکیده

Preliminary communication The paper analyses the changes in relationship between documents in textual corpus that occur due to the translation into another language. Authors analyzed the similarities between documents in original corpus, in Croatian, and compared them with the corresponding documents in translated corpus, in English. The changes were analyzed using two measures, chi-square test’s P-value and new proposed measure, correction coefficient.

منابع مشابه

Arabic News Articles Classification Using Vectorized-Cosine Based on Seed Documents

Besides for its own merits, text classification (TC) has become a cornerstone in many applications. Work presented here is part of and a pre-requisite for a project we have overtaken to create a corpus for the Arabic text process. It is an attempt to create modules automatically that would help speed up the process of classification for any text categorization task. It also serves as a tool for...

متن کامل

A Minimally Supervised Approach for Detecting and Ranking Document Translation Pairs

We describe an approach for generating a ranked list of candidate document translation pairs without the use of bilingual dictionary or machine translation system. We developed this approach as an initial, filtering step, for extracting parallel text from large, multilingual—but non-parallel— corpora. We represent bilingual documents in a vector space whose basis vectors are the overlapping tok...

متن کامل

A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection

In this paper we describe our effort to create a dataset for the evaluation of cross-language textual similarity detection. We present preexisting corpora and their limits and we explain the various gathered resources to overcome these limits and build our enriched dataset. The proposed dataset is multilingual, includes cross-language alignment for different granularities (from chunk to documen...

متن کامل

Improving Statistical Machine Translation Using Comparable Corpora

Title of dissertation: Improving Statistical Machine Translation Using Comparable Corpora Matthew Garvey Snover, Doctor of Philosophy, 2010 Dissertation directed by: Professor Bonnie Dorr Department of Computer Science With thousands of languages in the world, and the increasing speed and quantity of information being distributed across the world, automatic translation between languages by comp...

متن کامل

Measuring the homogeneity and similarity of language corpora

Corpus-based methods are now dominant in Natural Language Processing (NLP) . Creating big corpora is no longer difficult and the technology to analyze them is growing faster, more robust and more accurate. However, when an NLP application performs well on one corpus, it is unclear whether this level of performance would be maintained on others. To make progress on these questions, we need metho...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017